Home Catalogue search

eng

Refine your search:
- Keyword
- Creator / Publisher
- Year
- Medium
- Type
- BLLDB-Access:
  - free (63)
  - subject to license (0)

Search in the Catalogues and Directories






	Sort by
Simple Search

Page: 1 2 3 4

Hits 1 – 20 of 63

1	Assessing the impact of OCR noise on multilingual event detection over digitised documents
	Boros, Emanuela; Nguyen, Nhu Khoa; Lejeune, Gaël...
	In: ISSN: 1432-5012 ; EISSN: 1432-1300 ; International Journal on Digital Libraries ; https://hal.archives-ouvertes.fr/hal-03635985 ; International Journal on Digital Libraries, Springer Verlag, 2022, ⟨10.1007/s00799-022-00325-2⟩ (2022)
	BASE
	Show details

2	Assessing the Impact of OCR Noise on Multilingual Event Detection over Digitised Documents ...
	Boros, Emanuela; Nguyen, Nhu Khoa; Lejeune, Gaël. - : Zenodo, 2022
	BASE
	Show details

3	Assessing the Impact of OCR Noise on Multilingual Event Detection over Digitised Documents ...
	Boros, Emanuela; Nguyen, Nhu Khoa; Lejeune, Gaël. - : Zenodo, 2022
	BASE
	Show details

4	L3i_LBPAM at the FinSim-2 task: Learning Financial Semantic Similarities with Siamese Transformers
	Nguyen, Nhu Khoa; Boros, Emanuela; Lejeune, Gaël...
	In: WWW '21: Companion Proceedings of the Web Conference 2021 ; WWW '21: The Web Conference 2021 ; https://hal.sorbonne-universite.fr/hal-03256324 ; WWW '21: The Web Conference 2021, Apr 2021, Ljubljana (virtual), Slovenia. pp.302-306, ⟨10.1145/3442442.3451384⟩ (2021)
	BASE
	Show details

5	Discovering Spatial Relations in Litterature: what is the influence of OCR noise ?
	Parfait, Caroline; Lejeune, Gaël; Alrahabi, Motasem; Roe, Glenn
	In: NewsEye’s international conference ; https://hal.archives-ouvertes.fr/hal-03199729 ; NewsEye’s international conference, Mar 2021, Paris, France (2021)
	Abstract: International audience ; Digital Humanities methods enable the exploration and exploitation of digitized corpora at unprecedented scales. They also allow for refined research at several levels of granularity, from syntactic or hermeneutic perspectives, or through the identification of geographical named-entities, which allows us to observe the evolution of language and its territorial distribution. However, there are notable limitations in the performance of Named Entities Recognition tools for humanities research due to the variability of the input data (linguistic, diachronic, diatopic variability). Moreover, this lack of robustness to variation is particularly striking when dealing with literary corpora, even more so when it involves early modern texts. The correct recognition of named entities is correlated with the training of the language model implemented in the NER system. Language models are usually trained on so called “clean data” – assembled under optimal laboratory conditions – and for application to a specific corpus, which thus limits their generalizability to other corpora. Moreover, language models for early modern texts often require access to large corpora which have previously been transcribed using OCR. The quality of these transcriptions remains the subject of many current research projects[Baledent et al., 2020]. In essence, the malfunctioning of NER tools is attributed, on the one hand, to the level of quality of the transcriptions provided as input and, on the other hand, to the fact that the corpus being processed does not correspond to the corpus on which the language model was trained. To overcome the problem related to the quality of OCR transcripts, users implement a strategy that is costly both in terms of time and financially, consisting of cleaning of the transcribed text. Indeed, any number of errors can exist in OCR transcriptions[Stanislawek et al., 2019] and this search for perfection, though perhaps feasible on very small corpus, can be never-ending and represents a considerable expenditure of time at larger scales. Our project seeks to evaluate out-of-the-box NER tools, specifically Spacy, on minimally-corrected OCR transcriptions. This experiment should allow us to see the capacity of these tools to do their work outside of ideal laboratory conditions, aiming to get closer to a more everyday use of these tools, i.e. as a user who has neither the time, nor money for corrections, but nevertheless seeks actionable results. By way of this tension between ideality and reality, we have eschewed for the moment any ground-truth, which are costly to produce. Nevertheless, we use what we consider to be a reference text. The reference texts are extracted from ELTeC, a multilingual European Literary Text Collection in which entire novels are available in standardized version. The texts we use in hypothesis-testing consist of the OCR transcription of the same texts, downloaded in PDF format from the Gallica website. The first novel on which we focus is Marguerite Audoux’s Marie-Claire (1910), a novel of about 34,500 words. We carried out initial tests on short text extracts of about words and found that the pre-trained Spacy models are capable of recognising a number of terms even when roughly transcribed by the OCR tool. The ”fr core news sm” model finds 79% of entities present in both the reference and the hypothesis text, and 12.5% of entities which are incorrectly spelled in the hypothesis text.
	Keyword: [SHS]Humanities and Social Sciences
	URL: https://hal.archives-ouvertes.fr/hal-03199729
	BASE
	Hide details

6	Multilingual Epidemic Event Extraction
	Mutuvi, Stephen; Boros, Emanuela; Doucet, Antoine...
	In: Towards Open and Trustworthy Digital Societies. 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021, Virtual Event, December 1–3, 2021, Proceedings ; https://hal.archives-ouvertes.fr/hal-03480551 ; Hao-Ren Ke; Chei Sian Lee; Kazunari Sugiyama. Towards Open and Trustworthy Digital Societies. 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021, Virtual Event, December 1–3, 2021, Proceedings, 13133, Springer, pp.139-156, 2021, Lecture Notes in Computer Science, 978-3-030-91668-8. ⟨10.1007/978-3-030-91669-5_12⟩ (2021)
	BASE
	Show details

7	Étude comparative de méthodes de classification multilingue appliquées à l'épidémiologie
	Mutuvi, Stephen; Boros, Emanuela; Doucet, Antoine...
	In: COnférence en Recherche d'Informations et Applications - CORIA 2021, French Information Retrieval Conference ; https://hal.archives-ouvertes.fr/hal-03320343 ; COnférence en Recherche d'Informations et Applications - CORIA 2021, French Information Retrieval Conference, Apr 2021, Grenoble (virtuel), France (2021)
	BASE
	Show details

8	« Exploiter un corpus de données textuelles sans post-traitement : l’écriture burlesque de la Fronde »
	Abiven, Karine; Lejeune, Gaël; Tanguy, Jean-Baptiste
	In: ISSN: 2736-2337 ; Humanités numériques ; https://hal.archives-ouvertes.fr/hal-03500616 ; Humanités numériques, Bruxelles: Humanistica, 2021 (2021)
	BASE
	Show details

9	Étude comparative de méthodes de classification multilingue appliquées à l'épidémiologie ...
	Mutuvi, Stephen; Boros, Emanuela; Doucet, Antoine. - : Zenodo, 2021
	BASE
	Show details

10	Multilingual Epidemic Event Extraction ...
	Mutuvi, Stephen; Boros, Emanuela; Doucet, Antoine. - : Zenodo, 2021
	BASE
	Show details

11	Impact Analysis of Document Digitization on Event Extraction ...
	Nguyen, Nhu Khoa; Boroş, Emanuela; Lejeune, Gaël. - : Zenodo, 2021
	BASE
	Show details

12	Token-level Multilingual Epidemic Dataset for Event Extraction ...
	Mutuvi, Stephen; Boros, Emanuela; Doucet, Antoine. - : Zenodo, 2021
	BASE
	Show details

13	Impact Analysis of Document Digitization on Event Extraction ...
	Nguyen, Nhu Khoa; Boroş, Emanuela; Lejeune, Gaël. - : Zenodo, 2021
	BASE
	Show details

14	Token-level Multilingual Epidemic Dataset for Event Extraction ...
	Mutuvi, Stephen; Boros, Emanuela; Doucet, Antoine. - : Zenodo, 2021
	BASE
	Show details

15	Multilingual Epidemic Event Extraction ...
	Mutuvi, Stephen; Boros, Emanuela; Doucet, Antoine. - : Zenodo, 2021
	BASE
	Show details

16	Étude comparative de méthodes de classification multilingue appliquées à l'épidémiologie ...
	Mutuvi, Stephen; Boros, Emanuela; Doucet, Antoine. - : Zenodo, 2021
	BASE
	Show details

17	Multilingual Epidemiological Text Classification: A Comparative Study ...
	Mutuvi, Stephen; Boros, Emanuela; Doucet, Antoine. - : Zenodo, 2021
	BASE
	Show details

18	Multilingual Epidemiological Text Classification: A Comparative Study ...
	Mutuvi, Stephen; Boros, Emanuela; Doucet, Antoine. - : Zenodo, 2021
	BASE
	Show details

19	SinNer@Clef-Hipe2020 : Sinful adaptation of SotA models for Named Entity Recognition in French and German
	Ortiz Suárez, Pedro Javier; Dupont, Yoann; Lejeune, Gaël...
	In: CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum ; https://hal.inria.fr/hal-02984746 ; CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Sep 2020, Thessaloniki / Virtual, Greece ; https://impresso.github.io/CLEF-HIPE-2020/ (2020)
	BASE
	Show details

20	Daniel@FinTOC’2 Shared Task: Title Detection and Structure Extraction
	Giguet, Emmanuel; Lejeune, Gaël; Tanguy, Jean-Baptiste
	In: st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation @COLING’2020 ; 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation @COLING’2020 ; https://hal.archives-ouvertes.fr/hal-03024867 ; 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation @COLING’2020, Dec 2020, Barcelone, Spain (2020)
	BASE
	Show details

Page: 1 2 3 4

© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern